PCA outperforms popular hidden variable inference methods for molecular QTL mapping

Background Estimating and accounting for hidden variables is widely practiced as an important step in molecular quantitative trait locus (molecular QTL, henceforth “QTL”) analysis for improving the power of QTL identification. However, few benchmark studies have been performed to evaluate the efficacy of the various methods developed for this purpose. Results Here we benchmark popular hidden variable inference methods including surrogate variable analysis (SVA), probabilistic estimation of expression residuals (PEER), and hidden covariates with prior (HCP) against principal component analysis (PCA)—a well-established dimension reduction and factor discovery method—via 362 synthetic and 110 real data sets. We show that PCA not only underlies the statistical methodology behind the popular methods but is also orders of magnitude faster, better-performing, and much easier to interpret and use. Conclusions To help researchers use PCA in their QTL analysis, we provide an R package PCAForQTL along with a detailed guide, both of which are freely available at https://github.com/heatherjzhou/PCAForQTL. We believe that using PCA rather than SVA, PEER, or HCP will substantially improve and simplify hidden variable inference in QTL mapping as well as increase the transparency and reproducibility of QTL research. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02761-4.

In addition, the simulation study in the original PEER publication [1] is imperfect in that the description of the data simulation and analysis is vague and inconsistent, and there is no reproducible code. In contrast, we describe our data simulation and analysis in detail (Sections S2 to S4) and provide the code we use to generate the results (see Availability of data and materials).

S2.1 Data simulation
In Simulation Design 1, we perform 10 replicates of the same experiment, where in each replicate, we follow the data simulation in Stegle et al.
In the first step, we simulate Y beforeDSE , the gene expression matrix before downstream effect, based on where ⊙ denotes element-wise multiplication. Specifically, in the genotype component, we have • S: genotype matrix. Each entry is drawn independently from Binom (2, prob = 0.4). That is, the target MAF is 0.4. In this work, all random sampling is independent unless otherwise specified. • I 1 : effect indicator matrix. Each entry is drawn from Ber (0.01).
In the second step, we simulate Y DSE , the gene expression matrix due to the downstream effect of genes, based on where we have • I 3 : effect indicator matrix. To simulate I 3 , we start with a zero matrix. Then, we randomly choose three rows corresponding to genes with at least one cis-QTL (Section S2.2). For each of these three rows, we randomly assign 30 entries corresponding to genes other than the current gene in consideration (avoiding self-loops) to be one. • B 3 : effect size matrix. Each entry is drawn from N (8, var = 0.8) for "strong downstream effects" [1].
As we see in Section S2.2, the downstream effect of genes induces trans-QTL relations.
In the third and last step, we define Y , the final, observed gene expression matrix, as

S2.2 Definition of truth
In a simulated data set, the cis-QTL relations are encoded in the q × p binary matrix I 1 . The l j-th entry being one means that SNP l and gene j form a cis-QTL pair (i.e., SNP l is a cis-QTL for gene j).
The trans-QTL relations are encoded in J, also a q × p binary matrix. J is defined based on I 1 and I 3 . Specifically, SNP l and gene j form a trans-QTL pair if and only if SNP l is a cis-QTL for gene j ′ and gene j ′ has downstream effect on gene j, j ′ ̸ = j.
The overall truth is encoded in 1 (I 1 + J) ≥ 1 , again a q × p binary matrix. We use this matrix as the truth when calculating AUPRCs. The l j-th entry being one means that SNP l and gene j form a cis-QTL or trans-QTL pair (or both).  Table S1: Summary of the main differences between Simulation Design 1 and Simulation Design 2. Highlighted in blue are the major data simulation limitations (Section S1) of Simulation Design 1, all of which we address in Simulation Design 2.
Replicate     Figure S1: Comparison of all 15 methods (Table 1) in terms of power and adjusted R 2 measures in Simulation Design 1 (the height of each bar represents the average across simulated data sets) and an example scree plot. a, b PCA is more powerful than SVA, PEER, and HCP both when we consider all QTL relations (a) and when we focus on trans-QTL relations (b). Binary decisions are made based on p-values using the Benjamini-Hochberg (BH) procedure and a target false discovery rate of 0.05. c, d, e PCA performs the best in terms of concordance score. PEER with a large K (dark orange bars) performs well in terms of adjusted R 2 but less well in terms of reverse adjusted R 2 . f An example scree plot that unambiguously suggests the true number of hidden covariates, seven in this case, as the reasonable number of PCs to choose (the y-axis represents the proportion of variance explained).

S3.1 Data simulation
In Simulation Design 2, we use real genotype data from GTEx [2], focus on cis-QTL detection, and carefully control the genotype effects and covariate effects in 176 experiments with two replicates per experiment. This simulation design takes inspiration from and is loosely based on Wang et al. [3].
We begin by obtaining SArray, the n × q × p genotype array that remains constant throughout Simulation Design 2. SArray[ , , j], an n × q matrix, is the genotype matrix for the q local common SNPs for gene j. We obtain SArray with the following steps: Therefore, we have a total of 2 × 4 × (7 + 6 + 5 + 4) = 8 × 22 = 176 experiments covering typical scenarios in QTL studies [3]. Following Wang et al. [3], we use the term "effect SNPs" to refer to SNPs that have a nonzero cis effect on a given gene.
Given numOfEffectSNPs, numOfCovariates, PVEGenotype, and PVECovariates, we simulate each data set based on where Y is the gene expression matrix, and ⊗ is defined as Specifically, in the genotype component, we have • SArray: genotype array. SArray[ , , j], an n × q matrix, is the genotype matrix for the q local common SNPs for gene j (see above). • I: effect indicator matrix.
-If numOfEffectSNPs = 1, then for each column, we randomly assign one entry to be one while keeping the other entries zero. -If numOfEffectSNPs = random, then each entry of I is drawn from Ber (1/q). This means that for each gene, the number of effect SNPs is drawn from Binom (q, prob = 1/q). This binomial distribution approximates the empirical distribution of the number of independent cis-eQTLs per gene in GTEx data [2] well ( Figure S2). • B 1 : effect size matrix. Each entry is drawn from N (0, 1).
In the covariate component, we have • X: covariate matrix. Each entry is drawn from N (0, 1). As in Simulation Design 1, the first K 1 columns are designated as the known covariates (X 1 , n × K 1 ), and the last K 2 columns are designated as the hidden covariates (X 2 , n × K 2 ). • B 2 : effect size matrix. Each entry is drawn from N (0, 1) and scaled (see below).
Lastly, in the noise component, we have • E: noise matrix. Each entry is drawn from N (0, 1) and scaled (see below).
The scaling for B 2 and E is to ensure that PVEGenotype and PVECovariates are as desired.
and separately scale (E) j so that If Var S j (IB 1 ) j = 0 (which is the case when (IB 1 ) j is a zero vector, i.e., when gene j has zero effect SNPs), then we only scale (E) j so that

S3.2 Definition of truth
In a simulated data set, I is a q × p binary matrix. The l j-th entry being one means that the lth local common SNP for gene j is an effect SNP for gene j. However, due to LD, the expression level of a gene may be strongly associated with SNPs other than its effect SNPs.
Therefore, we define I cor , also a q × p binary matrix, based on SArray and I and use it as the truth when calculating AUPRCs. The l j-th entry of I cor is defined as one if and only if the lth local common SNP for gene j is highly correlated with any of gene j's effect SNPs (correlation ≥ 0.9 in absolute value).  Figure S2: In Simulation Design 2, we find that Binom (1000, prob = 1/1000) approximates the empirical distribution of the number of independent cis-eQTLs per gene in GTEx data [2] well. a Given a tissue type, which corresponds to a sample size, we plot the proportion of genes with 0, 1, 2, 3, 4, or 5 or more independent cis-eQTLs (the proportions add up to one; data from GTEx [2]). We find that the proportions stabilize once the sample size reaches about 517 (dashed line). b For the eight tissue types with sample size ≥ 517, we take the average proportion of genes with 0 independent cis-eQTLs, 1 independent cis-eQTL, etc. and plot them in the blue bars. The green bars represent the probability mass function of Binom (1000, prob = 1/1000) (with the tail probabilities combined together).  Figure S3: This figure shows how we select a few representative methods from the 15 methods for detailed comparison in Simulation Design 2 (a, b, c) and a dataset-by-dataset comparison of the selected representative methods (d). The x-axis and y-axis both represent AUPRCs of different methods. Each scatter plot contains 352 points, each of which corresponds to a simulated data set in Simulation Design 2. The number on the upper-left corner of each scatter plot represents the proportion of points that satisfy y > 1.02 x, and the number on the lower-right corner represents the proportion of points that satisfy x > 1.02 y, where x and y denote the coordinates of each point. a The two PCA methods perform almost identically, so for simplicity, we select PCA direct screeK. The two SVA methods perform almost identically as well, so we select SVA BE. b Whether the known covariates are inputted when PEER is run has little effect on the AUPRC. c When we use the true K, the factor approach outperforms the residual approach, but when we use a large K, the residual approach outperforms the factor approach. Therefore, we select PEER withCov trueK factors and PEER withCov largeK residuals as the representative PEER methods. d Among the selected representative methods, PCA outperforms SVA, PEER, and HCP in terms of AUPRC in 11% to 88% of the simulated data sets and underperforms them in close to 0% of the simulated data sets.  Figure S4: Detailed adjusted R 2 , reverse adjusted R 2 , and concordance score comparison of the selected representative methods (Table 1) in Simulation Design 2. Each point represents the average across simulated data sets. PCA performs the best in all three regards. PEER with a large K (dark orange line) performs well in terms of adjusted R 2 but falls short in terms of reverse adjusted R 2 .

S4 Compared methods
We compare the runtime and performance of 15 methods based on simulation studies, including Ideal, Unadjusted, and 13 variants of PCA, SVA, PEER, and HCP (Table 1). The details of Simulation Design 1 and Simulation Design 2 are described in Sections S2 and S3, respectively. Recall that in each simulated data set, Y denotes the gene expression matrix (n × p , sample by gene), X 1 denotes the known covariate matrix (n × K 1 , sample by covariate), and X 2 denotes the hidden covariate matrix (n × K 2 , sample by covariate). The genotype information is stored in S in Simulation Design 1 and SArray in Simulation Design 2. In this work, we use K to denote the number of inferred covariates, which are called PCs, SVs, PEER factors, and HCPs in PCA, SVA, PEER, and HCP, respectively.
Given a simulated data set, each of the 15 methods consists of two steps: hidden variable inference step (not applicable for Ideal and Unadjusted) and QTL step. In the hidden variable inference step, we run PCA, SVA, PEER, or HCP to obtain the inferred covariates (and the expression residuals in the case of PEER; Figure 1). In the QTL step, given a gene-SNP pair, we run a linear regression with the gene expression vector (or the residual vector from PEER) as the response and the genotype vector and covariates as predictors, where the choice of the response and covariates depends on the method (Table 1); thus we obtain the p-value for the null hypothesis that the coefficient corresponding to the genotype vector is zero given the covariates.
In Simulation Design 1, we investigate the association between each gene's expression level and each SNP in the entire genome for a simultaneous detection of cis-QTL and trans-QTL relations.
In Simulation Design 2, we investigate the association between each gene's expression level and each of the gene's local common SNPs for a cis-QTL analysis.
For Ideal, we assume that X 2 is known. Therefore, we use X 1 and X 2 as covariates in the QTL step. For Unadjusted, we use X 1 as the covariates.
We devise two ways to use PCA to account for the hidden covariates. For PCA direct screeK, we run PCA on Y directly. For PCA resid screeK, we first residualize Y against X 1 and then run PCA on the residual matrix. In this work, PCA is run with centering and scaling unless otherwise specified; given A, an n × p 1 matrix, and B, an n × p 2 matrix, both observation by feature, to residualize A against B means to take each column of A, regress it against B, and replace the original column of A with the residuals from the linear regression. For both methods, since the scree plots always suggest the true "number of hidden covariates" (K 1 + K 2 for PCA direct screeK, K 2 for PCA resid screeK) as the reasonable number of PCs to choose within plus or minus one (usually exactly; Figure S1), we set the number of PCs to be the true "number of hidden covariates". For PCA direct screeK, we filter out the known covariates that are captured well by the top PCs (unadjusted R 2 ≥ 0.9) and use the remaining known covariates along with the top PCs as covariates in the QTL step. For PCA resid screeK, no filtering is needed.
Here we describe the two hidden variable inference methods for SVA: SVA trueK and SVA BE.
Since the SVA package [4] requires the user to input at least one variable of interest ( Figure 1) and using too many variables of interest causes the package to fail, when running SVA, we input the top PC of the genotype matrix (S in Simulation Design 1, collapsed version of SArray in Simulation Design 2) as the variable of interest. We also input X 1 as the known covariates because the package documentation indicates that the known covariates should be provided if available. The SVA package allows the user to specify K. Alternatively, it can automatically choose K using a slightly modified version of the Buja and Eyuboglu (BE) algorithm [5,6]. Therefore, in SVA trueK, we set K = K 2 , and in SVA BE, we let the package choose K automatically. In both cases, we use X 1 and the surrogate variables (SVs) as covariates in the QTL step.
There are several different ways to use PEER [7] but no consensus in the literature on which one is the best. In the hidden variable inference step, PEER can be run with or without the known covariates when there are known covariates available (Stegle et al. [7] do not give an explicit recommendation as to which approach should be used, and both approaches are used in practice [2,[8][9][10]), and K has to be specified by the user (Stegle et al. [1,7] claim that the performance of PEER does not deteriorate as K increases). In the QTL step, one can include the PEER factors as covariates (we call this the "factor approach") or use the expression residuals outputted by PEER as the response (and not use any known or inferred covariates; we call this the "residual approach"). For completeness, we compare 2 3 = 8 ways of using PEER (the default priors are always used): PEER is run with or without the known covariates; PEER is run using the true "number of hidden covariates" (K 1 + K 2 when PEER is run without the known covariates, K 2 when PEER is run with the known covariates) or using a large K (K=50); and either the factor approach or the residual approach is used in the QTL step.
The HCP package requires the user to specify K and three tuning parameters: λ 1 , λ 2 , and λ 3 (Section S5.2). The package documentation suggests choosing K and the tuning parameters via a grid search. However, no specific recommendations are given regarding the choice of the score function. In practice, users of HCP often choose K and the tuning parameters by maximizing the number of discoveries [11,12]. For our simulation studies, such an approach would be computationally prohibitive. Therefore, for simplicity, we set K = K 2 and λ 1 = λ 2 = λ 3 = 1; the latter is because we do not want to give more weight to the penalty terms than the main term in the objective function (Section S5.2).  Table S3: Summary of QTL analyses performed by GTEx [2,8] and Li et al. [9]. "INT" in (D) stands for "inverse normal transform" [13]. (E), (F), and (G) summarize how PEER is used (Section S4) in each study. GTEx [2,8] chooses the number of PEER factors for its eQTL analyses (including cis and trans) by maximizing the number of discovered cis-eGenes for each pre-defined sample size bin. The number of PEER factors selected is 15 for n < 150, 30 for n ∈ [150, 250), and 35 for n ≥ 250 for GTEx V6p eQTL analyses [8] and 15 for n < 150, 30 for n ∈ [150, 250), 45 for n ∈ [250, 350), and 60 for n ≥ 350 for GTEx V8 eQTL analyses [2], where n denotes the sample size. Li et al. [9] use the numbers of PEER factors chosen by GTEx [8].  Figure S5: In the 3 ′ aQTL data prepared by Li et al. [9] from GTEx RNA-seq reads [8], PEER factors fail to capture important variance components of the molecular phenotype data when the data transformation method is "No transformation" or "INT within sample" (a; the numbers of PCs are chosen via BE (Algorithm S3)). On the other hand, PEER factors span roughly the same linear subspace as the top PCs when the data transformation method is "Center and scale" or "INT within feature", but the top PCs can almost always capture the PEER factors better than the PEER factors can capture the top PCs (b; the numbers of PCs are equal to the numbers of PEER factors). Given m PEER factors and n PCs from the same post-transformation molecular phenotype matrix (m ≥ n in a, m = n in b), we calculate m adjusted R 2 's by regressing each PEER factor against the PCs and plot the average in blue. Similarly, we calculate n adjusted R 2 's by regressing each PC against the PEER factors and plot the average in orange.

Reference
Algorithm S1: Reordering of PEER factors based on PCs ( Figure 5).
Select the PEER factor that is the most highly correlated with the kth PC from the PEER factors that have not been selected yet. 3 end 4 return the PEER factors in the order that they were selected in.    Figure S6: In GTEx eQTL data [2], PEER is at least three orders of magnitude slower than PCA (a), and replacing the PEER factors with PCs in GTEx's FastQTL pipeline [2,14] does not change the cis-eQTL results much (b, c, d) . The x-axis shows 10 randomly selected tissue types with increasing sample sizes. a For a given gene expression matrix, running PEER without the known covariates (GTEx's approach) takes up to about 1,900 minutes (equivalent to about 32 hours; Whole Blood), while running PCA (with centering and scaling; our approach) takes no more than a minute. For comparison, we also run PEER with the known covariates using the numbers of PEER factors selected by GTEx. This approach takes even longer (up to about 4,600 minutes, equivalent to about 77 hours; Esophagus -Mucosa). b The p-values produced by GTEx's approach and our approach are highly correlated (correlations between the negative common logarithms are shown). c, d The overlap of the identified eGenes and eQTL pairs between the two approaches is generally around 90% (see Figure S7 for more detail).   Figure S8: Joint analysis of results from Simulation Design 1, Simulation Design 2, and GTEx eQTL data [2]. The methods compared in a through d are PCA direct screeK and PEER noCov trueK factors. e presents similar information as Figure 5; the total count is 49, which is the number of tissue types with GTEx eQTL analyses. f is based on Figure S6(d); the total count is 10, which is the number of tissue types randomly selected for analysis in Figure S6. We find that the percentage of QTL discoveries shared is a good predictor of the relative performance of PEER versus PCA and is a better predictor than concordance. This plot is also evidence that Simulation Design 2 is more realistic than Simulation Design 1 because the ranges that concordance and percentage of QTL discoveries shared fall in in e and f agree better with those in c and d than those in a and b.
S5 Theory of PCA and HCP

S5.1 Principal component analysis (PCA)
Principal component analysis (PCA) [15,16] is a well-established dimension reduction method with many applications. Here we aim to provide a brief summary of its algorithm, derivation, and interpretation.
Let X denote the n × p observed data matrix that is observation by feature, e.g., a molecular phenotype matrix. We use X instead of Y here to be consistent with standard PCA notations. We assume that the columns of X have been centered and scaled. That is, X satisfies and where x i j denotes the i j-th entry of X.
The PCA algorithm consists of two steps. In the first step, we calculate the sample covariance matrix Σ and perform eigendecomposition on it: Σ = 1 n X ⊤ X definition of sample covariance matrix (S12) is an orthogonal matrix whose columns are eigenvectors of Σ, and is a diagonal matrix whose diagonal entries are the corresponding eigenvalues of Σ. We know that Σ is orthogonally diagonalizable because it is a symmetric matrix (recall the spectral theorem for real matrices [17]: a real matrix is orthogonally diagonalizable if and only if it is symmetric). The eigenvalues are all non-negative because Σ is positive semidefinite.
In the second step, we calculate Z as where the columns of Z are called the principal components (PCs) or scores, and Q is called the loading matrix or rotation matrix. It is worth noting that some authors may refer to q 1 , · · · , q p as the PCs. This use of terminology is confusing and should be avoided [18].
The above two steps conclude the PCA algorithm. In practice, however, singular value decomposition (SVD) of the data matrix is often used as a more computationally efficient way of finding the loading matrix and the PCs [15]. function where Q K denotes an arbitrary p × K matrix whose columns are orthonormal, |||·||| F denotes the Frobenius norm of a matrix, and x ⊤ i denotes the ith row of X. Since Q K Q ⊤ K x i represents the (orthogonal) projection of x i onto the subspace spanned by the columns of Q K , (S22) measures the total squared ℓ 2 error when approximating each x i with its projection onto the subspace spanned by the columns of Q K .
A central idea of PCA is the proportion of variance explained by each PC. To establish this concept, we claim that Var Z j = λ j , j = 1, · · · , p , and Cov Z j , Z j ′ = 0 , j, j ′ = 1, · · · , p , j ̸ = j ′ , where X j denotes the jth column of X (the jth original variable) and Z j denotes the jth column of Z (the jth PC). (S25) means that the PCs are uncorrelated with each other.
We prove (S24) and (S25) by calculating Σ Z , the sample covariance matrix of Z (we know that the columns of Z are centered by (S10) and (S16)): Because of (S23) and (S24), we may define the proportion of variance in the original data explained by the jth PC as which provides a basis for deciding the number of PCs to keep (e.g., Algorithms S2 and S3).

S5.2 Hidden covariates with prior (HCP) and its connection to PCA
Hidden covariates with prior (HCP) [21] is a popular hidden variable inference method for QTL mapping defined by minimizing a loss function. Neither Mostafavi et al. [21] nor the HCP package documents the HCP method well. For example, the squares in the loss function (S37) are missing in both Mostafavi et al. [21] and the package documentation, but one can deduce that the squares are there by inspecting the coordinate descent steps in the source code of the R package. Here we aim to provide a better, more accurate documentation of the HCP method and point out its connection to PCA.
Given Y , the molecular phenotype matrix (n × p, sample by feature), X 1 , the known covariate matrix (n × K 1 , sample by covariate), K, the number of inferred covariates (HCPs), and λ 1 , λ 2 , λ 3 > 0, the tuning parameters, HCP looks for arg min where |||·||| F denotes the Frobenius norm of a matrix, X 2 is the hidden covariate matrix, and W 1 and W 2 are weight matrices of the appropriate dimensions. The name of the method, "hidden covariates with prior", comes from the second term in (S37), where the method informs the hidden covariates with the known covariates. The optimization is done through coordinate descent with one deterministic initialization (see source code of the HCP R package [21]). The columns of the obtained X 2 are reported as the HCPs.
From (S37), we see that the HCP method is closely related to PCA. The first term in (S37) is very similar to (S21), the only difference being that the rows of W 2 in (S37) are not required to be orthonormal and X 2 is not required to be equal to YW ⊤ 2 .
Algorithm S2: The elbow method for choosing K in PCA (based on distance to diagonal line).
Input: X, n × p observed data matrix, observation by feature.
Output: K, the number of PCs selected.